2025-05-29-12-07
Make Planning Research Rigorous Again!
Abstract
arXiv:2505.21674v1 Announce Type: new Abstract: In over sixty years since its inception, the field of planning has made significant contributions to both the theory and practice of building planning software that can solve a never-before-seen planning problem. This was done through established practices of rigorous design and evaluation of planning systems. It is our position that this rigor should be applied to the current trend of work on planning with large language models. One way to do so is by correctly incorporating the insights, tools, and data from the automated planning community into the design and evaluation of LLM-based planners. The experience and expertise of the planning community are not just important from a historical perspective; the lessons learned could play a crucial role in accelerating the development of LLM-based planners. This position is particularly important in light of the abundance of recent works that replicate and propagate the same pitfalls that the planning community has encountered and learned from. We believe that avoiding such known pitfalls will contribute greatly to the progress in building LLM-based planners and to planning in general.
摘要
自规划领域诞生六十余年来,其在构建能够解决全新规划问题的规划软件理论与实践方面做出了重大贡献。这一成就源于对规划系统进行严格设计与评估的既定实践。我们认为,当前基于大语言模型的规划研究热潮同样需要贯彻这种严谨性。实现路径之一是将自动化规划领域的洞见、工具和数据正确整合到基于LLM的规划器设计与评估中。规划界的经验与专业积淀不仅具有历史意义,其积累的教训更能对加速LLM规划器发展起到关键作用。鉴于近期大量研究正在重复规划领域曾遭遇并克服过的相同陷阱,这一立场显得尤为重要。我们相信,规避这些已知陷阱将极大推动基于LLM的规划器发展,并对整个规划领域产生深远影响。
Herd Behavior: Investigating Peer Influence in LLM-based Multi-Agent Systems
Abstract
arXiv:2505.21588v1 Announce Type: new Abstract: Recent advancements in Large Language Models (LLMs) have enabled the emergence of multi-agent systems where LLMs interact, collaborate, and make decisions in shared environments. While individual model behavior has been extensively studied, the dynamics of peer influence in such systems remain underexplored. In this paper, we investigate herd behavior, the tendency of agents to align their outputs with those of their peers, within LLM-based multi-agent interactions. We present a series of controlled experiments that reveal how herd behaviors are shaped by multiple factors. First, we show that the gap between self-confidence and perceived confidence in peers significantly impacts an agent's likelihood to conform. Second, we find that the format in which peer information is presented plays a critical role in modulating the strength of herd behavior. Finally, we demonstrate that the degree of herd behavior can be systematically controlled, and that appropriately calibrated herd tendencies can enhance collaborative outcomes. These findings offer new insights into the social dynamics of LLM-based systems and open pathways for designing more effective and adaptive multi-agent collaboration frameworks.
摘要
大型语言模型(LLM)的最新进展推动了多智能体系统的出现,这些系统中的LLM能够在共享环境中交互、协作并做出决策。尽管单个模型的行为已得到广泛研究,但此类系统中同伴影响的动态机制仍未充分探索。本文研究了基于LLM的多智能体交互中的从众行为——即智能体倾向于使其输出与同伴保持一致的倾向。我们通过一系列受控实验揭示了从众行为如何受多种因素影响:首先,研究表明自我置信度与感知同伴置信度之间的差距显著影响智能体的从众概率;其次,发现同伴信息的呈现形式对调节从众行为强度具有关键作用;最后,我们证明从众程度可被系统调控,且适当校准的从众倾向能提升协作效果。这些发现为基于LLM系统的社会动力学提供了新见解,并为设计更高效、自适应的多智能体协作框架开辟了路径。
Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation
Abstract
arXiv:2505.21880v1 Announce Type: new Abstract: This study presents an innovative approach to urban mobility simulation by integrating a Large Language Model (LLM) with Agent-Based Modeling (ABM). Unlike traditional rule-based ABM, the proposed framework leverages LLM to enhance agent diversity and realism by generating synthetic population profiles, allocating routine and occasional locations, and simulating personalized routes. Using real-world data, the simulation models individual behaviors and large-scale mobility patterns in Taipei City. Key insights, such as route heat maps and mode-specific indicators, provide urban planners with actionable information for policy-making. Future work focuses on establishing robust validation frameworks to ensure accuracy and reliability in urban planning applications.
摘要
本研究提出一种创新性城市移动性模拟方法,通过将大语言模型(LLM)与基于智能体的建模(ABM)相结合。与传统基于规则的ABM不同,该框架利用LLM生成合成人口特征、分配常规与偶发活动地点,并模拟个性化路线,从而增强智能体多样性与真实性。基于台北市真实数据的仿真实验,成功模拟了个体行为与大规模移动模式。关键发现如路线热力图和交通方式专项指标,为城市规划者提供了可操作的决策依据。未来工作将致力于建立稳健的验证框架,以确保城市规划应用中的准确性与可靠性。
StreamLink: Large-Language-Model Driven Distributed Data Engineering System
Abstract
arXiv:2505.21575v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown remarkable proficiency in natural language understanding (NLU), opening doors for innovative applications. We introduce StreamLink - an LLM-driven distributed data system designed to improve the efficiency and accessibility of data engineering tasks. We build StreamLink on top of distributed frameworks such as Apache Spark and Hadoop to handle large data at scale. One of the important design philosophies of StreamLink is to respect user data privacy by utilizing local fine-tuned LLMs instead of a public AI service like ChatGPT. With help from domain-adapted LLMs, we can improve our system's understanding of natural language queries from users in various scenarios and simplify the procedure of generating database queries like the Structured Query Language (SQL) for information processing. We also incorporate LLM-based syntax and security checkers to guarantee the reliability and safety of each generated query. StreamLink illustrates the potential of merging generative LLMs with distributed data processing for comprehensive and user-centric data engineering. With this architecture, we allow users to interact with complex database systems at different scales in a user-friendly and security-ensured manner, where the SQL generation reaches over 10% of execution accuracy compared to baseline methods, and allow users to find the most concerned item from hundreds of millions of items within a few seconds using natural language.
摘要
大型语言模型(LLMs)在自然语言理解(NLU)方面展现出卓越能力,为创新应用开辟了道路。我们提出StreamLink——一个基于LLM的分布式数据系统,旨在提升数据工程任务的效率与可访问性。该系统构建于Apache Spark和Hadoop等分布式框架之上,以支持大规模数据处理。StreamLink的重要设计理念之一是通过采用本地微调的LLMs(而非ChatGPT等公共AI服务)来保障用户数据隐私。借助领域适配的LLMs,我们能够增强系统对多样化场景下用户自然语言查询的理解能力,并简化生成结构化查询语言(SQL)等数据库查询的信息处理流程。系统还集成了基于LLM的语法与安全检查器,确保每个生成查询的可靠性与安全性。StreamLink展现了生成式LLMs与分布式数据处理技术融合的潜力,可实现以用户为中心的全方位数据工程。通过该架构,用户能以友好且安全的方式与不同规模的复杂数据库系统交互:相比基线方法,其SQL生成执行准确率提升超过10%,并支持用户在数秒内从数亿条数据中通过自然语言定位最关注的项目。
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation
Abstract
arXiv:2505.21784v1 Announce Type: new Abstract: Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE
摘要
安全推理是一种新兴范式,大型语言模型(LLM)通过先对安全策略进行推理再生成响应,从而缓解现有安全措施(如过度拒绝和越狱漏洞)的局限性。然而,由于创建高质量策略嵌入思维链(CoT)数据集需要耗费大量资源,同时还需确保推理的准确性和避免幻觉或策略冲突,该范式的实施面临挑战。为此,我们提出AIDSAFE:安全推理的代理迭代审议方法——一种利用多智能体审议迭代扩展安全策略推理的新型数据生成方案。AIDSAFE中的数据精炼阶段通过消除重复、冗余和欺骗性思维来保证输出质量。AIDSAFE生成的思维链为基于监督微调(SFT)的安全训练提供了坚实基础。此外,针对对齐阶段(如DPO训练)对偏好数据的需求,我们引入了一种补充方案,利用信念增强来创建差异化的选定与拒绝思维链样本。评估表明,AIDSAFE生成的思维链在策略遵循性和推理质量上表现优异。实验证明,基于这些思维链对开源LLM进行微调,可显著提升安全泛化能力和越狱鲁棒性,同时保持可接受的实用性和过度拒绝准确性。AIDSAFE生成的思维链数据集详见:https://huggingface.co/datasets/AmazonScience/AIDSAFE
From Reasoning to Learning: A Survey on Hypothesis Discovery and Rule Learning with Large Language Models
Abstract
arXiv:2505.21935v1 Announce Type: new Abstract: Since the advent of Large Language Models (LLMs), efforts have largely focused on improving their instruction-following and deductive reasoning abilities, leaving open the question of whether these models can truly discover new knowledge. In pursuit of artificial general intelligence (AGI), there is a growing need for models that not only execute commands or retrieve information but also learn, reason, and generate new knowledge by formulating novel hypotheses and theories that deepen our understanding of the world. Guided by Peirce's framework of abduction, deduction, and induction, this survey offers a structured lens to examine LLM-based hypothesis discovery. We synthesize existing work in hypothesis generation, application, and validation, identifying both key achievements and critical gaps. By unifying these threads, we illuminate how LLMs might evolve from mere ``information executors'' into engines of genuine innovation, potentially transforming research, science, and real-world problem solving.
摘要
自大型语言模型(LLMs)问世以来,研究重点多集中于提升其指令遵循与演绎推理能力,而关于这些模型能否真正发现新知识的问题仍悬而未决。在追求通用人工智能(AGI)的过程中,我们日益需要模型不仅能执行指令或检索信息,更能通过学习、推理和生成新知识来提出深化人类认知的新假设与理论。本文以皮尔士的"溯因-演绎-归纳"框架为指导,为基于LLM的假设发现研究提供结构化视角。我们系统梳理了假设生成、应用与验证领域的现有成果,既总结了关键突破,也指出了核心缺陷。通过整合这些研究方向,本文阐明了LLMs如何可能从单纯的"信息执行者"蜕变为真正创新的引擎,从而潜在变革科学研究与现实问题解决的范式。
Large Language Models for Solving Economic Dispatch Problem
Abstract
arXiv:2505.21931v1 Announce Type: new Abstract: This paper investigates the capability of off-the-shelf large language models (LLMs) to solve the economic dispatch (ED) problem. ED is a hard-constrained optimization problem solved on a day-ahead timescale by grid operators to minimize electricity generation costs while accounting for physical and engineering constraints. Numerous approaches have been proposed, but these typically require either mathematical formulations, face convergence issues, or depend on extensive labeled data and training time. This work implements LLMs enhanced with reasoning capabilities to address the classic lossless ED problem. The proposed approach avoids the need for explicit mathematical formulations, does not suffer from convergence challenges, and requires neither labeled data nor extensive training. A few-shot learning technique is utilized in two different prompting contexts. The IEEE 118-bus system with 19 generation units serves as the evaluation benchmark. Results demonstrate that various prompting strategies enable LLMs to effectively solve the ED problem, offering a convenient and efficient alternative. Consequently, this approach presents a promising future solution for ED tasks, particularly when foundational power system models are available.
摘要
本文研究了现成大型语言模型(LLMs)解决经济调度(ED)问题的能力。ED是电网运营商在日前时间尺度上求解的硬约束优化问题,旨在满足物理和工程约束的同时最小化发电成本。尽管已有多种解决方案,但这些方法通常需要数学公式、面临收敛问题,或依赖大量标注数据和训练时间。本研究采用具备推理能力增强的LLMs来解决经典的无损ED问题,所提方法无需显式数学公式、不存在收敛挑战,且不需要标注数据或大量训练。我们在两种不同的提示场景中应用了小样本学习技术,并以包含19台发电机组的IEEE 118节点系统作为评估基准。结果表明,多种提示策略能使LLMs有效求解ED问题,提供了一种便捷高效的替代方案。因此,该方法为ED任务(特别是在具备电力系统基础模型的情况下)展现出了极具前景的未来解决方案。
AI-Supported Platform for System Monitoring and Decision-Making in Nuclear Waste Management with Large Language Models
Abstract
arXiv:2505.21741v1 Announce Type: new Abstract: Nuclear waste management requires rigorous regulatory compliance assessment, demanding advanced decision-support systems capable of addressing complex legal, environmental, and safety considerations. This paper presents a multi-agent Retrieval-Augmented Generation (RAG) system that integrates large language models (LLMs) with document retrieval mechanisms to enhance decision accuracy through structured agent collaboration. Through a structured 10-round discussion model, agents collaborate to assess regulatory compliance and safety requirements while maintaining document-grounded responses. Implemented on consumer-grade hardware, the system leverages Llama 3.2 and mxbai-embed-large-v1 embeddings for efficient retrieval and semantic representation. A case study of a proposed temporary nuclear waste storage site near Winslow, Arizona, demonstrates the framework's effectiveness. Results show the Regulatory Agent achieves consistently higher relevance scores in maintaining alignment with legal frameworks, while the Safety Agent effectively manages complex risk assessments requiring multifaceted analysis. The system demonstrates progressive improvement in agreement rates between agents across discussion rounds while semantic drift decreases, indicating enhanced decision-making consistency and response coherence. The system ensures regulatory decisions remain factually grounded, dynamically adapting to evolving regulatory frameworks through real-time document retrieval. By balancing automated assessment with human oversight, this framework offers a scalable and transparent approach to regulatory governance. These findings underscore the potential of AI-driven, multi-agent systems in advancing evidence-based, accountable, and adaptive decision-making for high-stakes environmental management scenarios.
摘要
核废料管理需要严格的法规遵从性评估,这要求决策支持系统能够处理复杂的法律、环境和安全因素。本文提出一种多智能体检索增强生成(RAG)系统,通过整合大语言模型(LLMs)与文档检索机制,以结构化智能体协作提升决策准确性。系统采用10轮结构化讨论模型,各智能体协作评估法规合规性与安全要求,同时保持基于文档的响应。在消费级硬件上实现时,该系统利用Llama 3.2和mxbai-embed-large-v1嵌入模型实现高效检索与语义表征。以亚利桑那州温斯洛附近拟建临时核废料储存场为例的案例研究验证了该框架的有效性。结果表明:法规智能体在保持法律框架一致性方面持续获得更高相关性评分,而安全智能体能有效处理需多维度分析的复杂风险评估。随着讨论轮次增加,智能体间共识率逐步提升且语义漂移降低,表明决策一致性与响应连贯性增强。该系统通过实时文档检索动态适应不断演变的法规框架,确保监管决策始终基于事实。通过平衡自动化评估与人工监督,该框架为监管治理提供了可扩展且透明的解决方案。这些发现凸显了人工智能驱动的多智能体系统在推进高风险环境管理场景中循证、可问责且适应性决策方面的潜力。
Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models
Abstract
arXiv:2505.21765v1 Announce Type: new Abstract: While recent success of large reasoning models (LRMs) significantly advanced LLMs' reasoning capability by optimizing the final answer accuracy using reinforcement learning, they may also drastically increase the output length due to overthinking, characterized by unnecessarily complex reasoning paths that waste computation and potentially degrade the performance. We hypothesize that such inefficiencies stem from LRMs' limited capability to dynamically select the proper modular reasoning strategies, termed thinking patterns at the right position. To investigate this hypothesis, we propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns, systematically identifying and promoting beneficial patterns that improve the answer while removing detrimental ones. Empirical analysis confirms that our optimized thinking paths yield more concise yet sufficiently informative trajectories, enhancing reasoning efficiency by reducing attention FLOPs by up to 47% while maintaining accuracy for originally correct responses. Moreover, a non-trivial portion of originally incorrect responses are transformed into correct ones, achieving a 15.6% accuracy improvement with reduced length. Motivated by the improvement brought by the optimized thinking paths, we apply a preference optimization technique supported by a pairwise dataset contrasting suboptimal and optimal reasoning paths. Experimental evaluations across multiple mathematical reasoning benchmarks reveal that our method notably reduces computational overhead while simultaneously improving reasoning accuracy, achieving up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.
摘要
尽管大型推理模型(LRMs)近期通过强化学习优化最终答案准确率显著提升了大型语言模型(LLMs)的推理能力,但其可能因过度思考而大幅增加输出长度——这种特征表现为不必要的复杂推理路径,既浪费计算资源又可能导致性能下降。我们假设这种低效性源于LRMs动态选择适当模块化推理策略(称为"思维模式")的能力不足。为验证该假设,我们提出一个动态优化框架:将模型生成的推理路径分割为不同思维模式,系统性地识别并提升有益模式以改进答案,同时剔除有害模式。实证分析表明,优化后的思维路径能产生更简洁且信息充分的轨迹,在保持原有正确答案准确率的同时,将注意力浮点运算量(FLOPs)降低达47%。此外,相当比例原本错误的答案被转化为正确结果,在缩短输出长度的同时实现了15.6%的准确率提升。基于优化思维路径带来的改进,我们采用偏好优化技术,通过对比次优与最优推理路径的配对数据集进行训练。在多个数学推理基准测试中,实验评估表明该方法显著降低了计算开销,同时提升推理准确率——最高实现12%的准确率提升,并将令牌使用量从约5,000个减少至3,000个。
Query, Don't Train: Privacy-Preserving Tabular Prediction from EHR Data via SQL Queries
Abstract
arXiv:2505.21801v1 Announce Type: new Abstract: Electronic health records (EHRs) contain richly structured, longitudinal data essential for predictive modeling, yet stringent privacy regulations (e.g., HIPAA, GDPR) often restrict access to individual-level records. We introduce Query, Don't Train (QDT): a structured-data foundation-model interface enabling tabular inference via LLM-generated SQL over EHRs. Instead of training on or accessing individual-level examples, QDT uses a large language model (LLM) as a schema-aware query planner to generate privacy-compliant SQL queries from a natural language task description and a test-time input. The model then extracts summary-level population statistics through these SQL queries and the LLM performs, chain-of-thought reasoning over the results to make predictions. This inference-time-only approach (1) eliminates the need for supervised model training or direct data access, (2) ensures interpretability through symbolic, auditable queries, (3) naturally handles missing features without imputation or preprocessing, and (4) effectively manages high-dimensional numerical data to enhance analytical capabilities. We validate QDT on the task of 30-day hospital readmission prediction for Type 2 diabetes patients using a MIMIC-style EHR cohort, achieving F1 = 0.70, which outperforms TabPFN (F1 = 0.68). To our knowledge, this is the first demonstration of LLM-driven, privacy-preserving structured prediction using only schema metadata and aggregate statistics - offering a scalable, interpretable, and regulation-compliant alternative to conventional foundation-model pipelines.
摘要
电子健康记录(EHRs)包含丰富且结构化的纵向数据,这对预测建模至关重要,但严格的隐私法规(如HIPAA、GDPR)通常限制对个体记录的访问。我们提出"查询而非训练"(QDT)方法:这是一种结构化数据基础模型接口,通过基于EHRs的LLM生成SQL实现表格推理。QDT无需在个体样本上训练或访问原始数据,而是利用大语言模型(LLM)作为模式感知的查询规划器,根据自然语言任务描述和测试时输入生成符合隐私要求的SQL查询。模型随后通过这些SQL查询提取汇总级群体统计量,并由LLM对结果进行思维链推理以生成预测。这种仅需推理时介入的方法具有以下优势:(1)无需监督模型训练或直接数据访问;(2)通过可审计的符号化查询确保可解释性;(3)天然处理缺失特征而无需插补或预处理;(4)有效管理高维数值数据以增强分析能力。我们在2型糖尿病患者30天再入院预测任务上验证QDT(使用MIMIC式EHR队列),取得F1=0.70,优于TabPFN(F1=0.68)。据我们所知,这是首个仅利用模式元数据和聚合统计量实现LLM驱动的隐私保护结构化预测的方案——为传统基础模型流程提供了可扩展、可解释且合规的替代方案。
R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning
Abstract
arXiv:2505.21668v1 Announce Type: new Abstract: Despite advances in reasoning and planning of R1-like models, Large Language Models (LLMs) still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0% to 64.1%, outperforming GPT-4o (text-only: 58.6%) and approaching GPT-4o with Code Interpreter (70.9%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.
摘要
尽管类R1模型在推理与规划方面取得进展,大型语言模型(LLMs)在需要精确计算、符号操作、优化和算法推理的任务中仍存在困难——这些场景下文本推理缺乏代码执行的严谨性。关键挑战在于如何让LLMs自主判断何时采用文本推理或代码生成。虽然OpenAI通过训练实现按需调用代码解释器,但公开研究缺乏关于如何对齐预训练LLMs以有效利用代码并泛化至多样任务的指导。我们提出R1-Code-Interpreter,通过对纯文本LLM进行多轮监督微调(SFT)和强化学习(RL)训练,使其能在逐步推理过程中自主生成多个代码查询。我们构建了144个推理与规划任务(107训练/37测试),每个任务包含200+多样化问题。采用不同SFT与RL策略对Qwen-2.5模型(3B/7B/14B)进行微调,研究包括:答案格式差异、推理与非推理模型对比、冷启动与热启动、GRPO与PPO算法比较,以及代码输出的掩码策略。与先前针对狭窄领域的RL研究不同,我们发现代码解释器训练因任务多样性和高昂的代码执行成本而显著困难,这凸显了SFT阶段的关键作用。最终模型R1-CI-14B将37项测试任务的平均准确率从44.0%提升至64.1%,超越GPT-4o纯文本模式(58.6%),并接近启用代码解释器的GPT-4o(70.9%),且通过代码生成展现出新兴的自检行为。数据集、代码与模型已开源:https://github.com/yongchao98/R1-Code-Interpreter 与 https://huggingface.co/yongchao98。
Efficiently Enhancing General Agents With Hierarchical-categorical Memory
Abstract
arXiv:2505.22006v1 Announce Type: new Abstract: With large language models (LLMs) demonstrating remarkable capabilities, there has been a surge in research on leveraging LLMs to build general-purpose multi-modal agents. However, existing approaches either rely on computationally expensive end-to-end training using large-scale multi-modal data or adopt tool-use methods that lack the ability to continuously learn and adapt to new environments. In this paper, we introduce EHC, a general agent capable of learning without parameter updates. EHC consists of a Hierarchical Memory Retrieval (HMR) module and a Task-Category Oriented Experience Learning (TOEL) module. The HMR module facilitates rapid retrieval of relevant memories and continuously stores new information without being constrained by memory capacity. The TOEL module enhances the agent's comprehension of various task characteristics by classifying experiences and extracting patterns across different categories. Extensive experiments conducted on multiple standard datasets demonstrate that EHC outperforms existing methods, achieving state-of-the-art performance and underscoring its effectiveness as a general agent for handling complex multi-modal tasks.
摘要
随着大语言模型(LLM)展现出卓越的能力,利用LLM构建通用多模态代理的研究呈现爆发式增长。然而,现有方法要么依赖基于大规模多模态数据的高计算成本端到端训练,要么采用缺乏持续学习与环境适应能力的工具使用方法。本文提出EHC——一种无需参数更新的通用学习代理,其核心由层次化记忆检索(HMR)模块和任务导向型经验学习(TOEL)模块构成。HMR模块通过高效检索相关记忆并突破存储容量限制持续更新信息;TOEL模块通过经验分类与跨类别模式提取,增强代理对不同任务特性的理解能力。在多个标准数据集上的实验表明,EHC以显著优势超越现有方法,其处理复杂多模态任务的性能达到当前最优水平,充分验证了作为通用代理的有效性。
SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts
Abstract
arXiv:2505.21828v1 Announce Type: new Abstract: Do LLMs robustly generalize critical safety facts to novel situations? Lacking this ability is dangerous when users ask naive questions. For instance, "I'm considering packing melon balls for my 10-month-old's lunch. What other foods would be good to include?" Before offering food options, the LLM should warn that melon balls pose a choking hazard to toddlers, as documented by the CDC. Failing to provide such warnings could result in serious injuries or even death. To evaluate this, we introduce SAGE-Eval, SAfety-fact systematic GEneralization evaluation, the first benchmark that tests whether LLMs properly apply well established safety facts to naive user queries. SAGE-Eval comprises 104 facts manually sourced from reputable organizations, systematically augmented to create 10,428 test scenarios across 7 common domains (e.g., Outdoor Activities, Medicine). We find that the top model, Claude-3.7-sonnet, passes only 58% of all the safety facts tested. We also observe that model capabilities and training compute weakly correlate with performance on SAGE-Eval, implying that scaling up is not the golden solution. Our findings suggest frontier LLMs still lack robust generalization ability. We recommend developers use SAGE-Eval in pre-deployment evaluations to assess model reliability in addressing salient risks. We publicly release SAGE-Eval at https://huggingface.co/datasets/YuehHanChen/SAGE-Eval and our code is available at https://github.com/YuehHanChen/SAGE-Eval/tree/main.
摘要
大语言模型(LLM)能否将关键安全知识稳健地推广至新情境?当用户提出天真问题时,缺乏这种能力是危险的。例如"我打算为10个月大的宝宝午餐准备蜜瓜球,还应该搭配哪些食物?"在推荐食物前,LLM应依据美国疾控中心(CDC)记录,警告蜜瓜球可能造成幼儿窒息风险。若未能提供此类警告,可能导致严重伤害甚至死亡。为此,我们提出SAGE-Eval(安全知识系统化泛化评估),首个评估LLM能否将公认安全知识正确应用于天真用户提问的基准。该基准包含从权威机构手动收集的104项安全知识,经系统化扩展形成7大常见领域(如户外活动、医药)共10,428个测试场景。研究发现,表现最佳的Claude-3.7-sonnet模型仅通过58%的安全知识测试。同时观察到模型能力与训练算力仅与SAGE-Eval表现呈弱相关性,表明单纯扩大规模并非最佳解决方案。研究结果表明前沿LLM仍缺乏稳健的泛化能力。建议开发者在部署前使用SAGE-Eval评估模型应对突出风险的可靠性。我们已在https://huggingface.co/datasets/YuehHanChen/SAGE-Eval 公开SAGE-Eval数据集,代码发布于https://github.com/YuehHanChen/SAGE-Eval/tree/main。
VIRAL: Vision-grounded Integration for Reward design And Learning
Abstract
arXiv:2505.22092v1 Announce Type: new Abstract: The alignment between humans and machines is a critical challenge in artificial intelligence today. Reinforcement learning, which aims to maximize a reward function, is particularly vulnerable to the risks associated with poorly designed reward functions. Recent advancements has shown that Large Language Models (LLMs) for reward generation can outperform human performance in this context. We introduce VIRAL, a pipeline for generating and refining reward functions through the use of multi-modal LLMs. VIRAL autonomously creates and interactively improves reward functions based on a given environment and a goal prompt or annotated image. The refinement process can incorporate human feedback or be guided by a description generated by a video LLM, which explains the agent's policy in video form. We evaluated VIRAL in five Gymnasium environments, demonstrating that it accelerates the learning of new behaviors while ensuring improved alignment with user intent. The source-code and demo video are available at: https://github.com/VIRAL-UCBL1/VIRAL and https://youtu.be/t4_BXugBm9Q.
摘要
人机对齐是当前人工智能领域的关键挑战。以奖励函数最大化为目标的强化学习方法,尤其容易受到设计不当的奖励函数所带来的风险影响。最新研究表明,基于大语言模型(LLMs)的奖励生成在此背景下可超越人类表现。本文提出VIRAL——一种通过多模态大语言模型生成与优化奖励函数的流程框架。该系统能基于给定环境及目标提示(或标注图像)自主创建并通过交互式迭代改进奖励函数。优化过程既可融入人类反馈,也可由视频大语言模型生成的策略描述(以视频形式呈现智能体行为)来指导实现。我们在五个Gymnasium环境中对VIRAL进行了评估,结果表明其不仅能加速新行为的学习,还能确保更精准地符合用户意图。源代码及演示视频详见:https://github.com/VIRAL-UCBL1/VIRAL 与 https://youtu.be/t4_BXugBm9Q。
Modeling and Optimizing User Preferences in AI Copilots: A Comprehensive Survey and Taxonomy
Abstract
arXiv:2505.21907v1 Announce Type: new Abstract: AI copilots, context-aware, AI-powered systems designed to assist users in tasks such as software development and content creation, are becoming integral to modern workflows. As these systems grow in capability and adoption, personalization has emerged as a cornerstone for ensuring usability, trust, and productivity. Central to this personalization is preference optimization: the ability of AI copilots to detect, interpret, and align with individual user preferences. While personalization techniques are well-established in domains like recommender systems and dialogue agents, their adaptation to interactive, real-time systems like AI copilots remains fragmented and underexplored. This survey addresses this gap by synthesizing research on how user preferences are captured, modeled, and refined within the design of AI copilots. We introduce a unified definition of AI copilots and propose a phase-based taxonomy of preference optimization strategies, structured around pre-interaction, mid-interaction, and post-interaction stages. We analyze techniques for acquiring preference signals, modeling user intent, and integrating feedback loops, highlighting both established approaches and recent innovations. By bridging insights from AI personalization, human-AI collaboration, and large language model adaptation, this survey provides a structured foundation for designing adaptive, preference-aware AI copilots. It offers a holistic view of the available preference resources, how they can be leveraged, and which technical approaches are most suited to each stage of system design.
摘要
AI协作者(AI copilots)作为情境感知、人工智能驱动的辅助系统,旨在帮助用户完成软件开发与内容创作等任务,正逐渐成为现代工作流程的核心组成部分。随着系统能力与应用范围的扩展,个性化已成为确保可用性、信任度与生产力的关键要素。其中偏好优化是个人化的核心环节,即AI协作者检测、解读并适应用户个体偏好的能力。尽管个性化技术在推荐系统与对话代理等领域已趋成熟,但其在AI协作者这类交互式实时系统中的适配研究仍呈现碎片化且探索不足的现状。本综述通过系统梳理AI协作者设计中用户偏好的捕获、建模与优化研究,填补了这一空白。我们提出了AI协作者的统一定义,并构建了基于交互前、交互中与交互后三阶段的偏好优化策略分类体系。通过分析偏好信号获取、用户意图建模及反馈循环整合的技术路径,既梳理了成熟方法,也突出了前沿创新。本研究融合了AI个性化、人机协作与大语言模型适配等领域的洞见,为设计具有自适应性与偏好感知能力的AI协作者提供了结构化理论基础,全面阐述了现有偏好资源的利用方式及其在系统设计各阶段的最适配技术方案。
Reinforced Reasoning for Embodied Planning
Abstract
arXiv:2505.22050v1 Announce Type: new Abstract: Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals. While recent vision-language models (VLMs) excel at static perception tasks, they struggle with the temporal reasoning, spatial understanding, and commonsense grounding needed for planning in interactive environments. In this work, we introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning. We first distill a high-quality dataset from a powerful closed-source model and perform supervised fine-tuning (SFT) to equip the model with structured decision-making priors. We then design a rule-based reward function tailored to multi-step action quality and optimize the policy via Generalized Reinforced Preference Optimization (GRPO). Our approach is evaluated on Embench, a recent benchmark for interactive embodied tasks, covering both in-domain and out-of-domain scenarios. Experimental results show that our method significantly outperforms models of similar or larger scale, including GPT-4o-mini and 70B+ open-source baselines, and exhibits strong generalization to unseen environments. This work highlights the potential of reinforcement-driven reasoning to advance long-horizon planning in embodied AI.
摘要
具身规划要求智能体基于动态视觉观察和自然语言目标做出连贯的多步决策。尽管当前视觉语言模型(VLMs)在静态感知任务中表现出色,但其在交互环境中进行规划所需的时间推理、空间理解和常识基础方面仍存在不足。本研究提出一种强化微调框架,将R1式推理增强引入具身规划。我们首先从强大的闭源模型中蒸馏出高质量数据集,并通过监督微调(SFT)赋予模型结构化决策先验。随后设计基于规则的多步动作质量奖励函数,采用广义强化偏好优化(GRPO)进行策略优化。该方法在交互式具身任务新基准Embench上进行评估,涵盖领域内和跨领域场景。实验结果表明,我们的方法显著优于规模相近或更大的模型(包括GPT-4o-mini和70B+开源基线),并对未见环境展现出强大泛化能力。本工作揭示了强化驱动推理在推进具身AI长程规划方面的潜力。
Visual Large Language Models Exhibit Human-Level Cognitive Flexibility in the Wisconsin Card Sorting Test
Abstract
arXiv:2505.22112v1 Announce Type: new Abstract: Cognitive flexibility has been extensively studied in human cognition but remains relatively unexplored in the context of Visual Large Language Models (VLLMs). This study assesses the cognitive flexibility of state-of-the-art VLLMs (GPT-4o, Gemini-1.5 Pro, and Claude-3.5 Sonnet) using the Wisconsin Card Sorting Test (WCST), a classic measure of set-shifting ability. Our results reveal that VLLMs achieve or surpass human-level set-shifting capabilities under chain-of-thought prompting with text-based inputs. However, their abilities are highly influenced by both input modality and prompting strategy. In addition, we find that through role-playing, VLLMs can simulate various functional deficits aligned with patients having impairments in cognitive flexibility, suggesting that VLLMs may possess a cognitive architecture, at least regarding the ability of set-shifting, similar to the brain. This study reveals the fact that VLLMs have already approached the human level on a key component underlying our higher cognition, and highlights the potential to use them to emulate complex brain processes.
摘要
认知灵活性在人类认知领域已得到广泛研究,但在视觉大语言模型(VLLMs)中的探索仍相对不足。本研究采用威斯康星卡片分类测试(WCST)——这一衡量定势转换能力的经典范式,对前沿VLLMs(GPT-4o、Gemini-1.5 Pro和Claude-3.5 Sonnet)的认知灵活性进行评估。结果表明,在思维链提示的文本输入条件下,VLLMs能够达到或超越人类水平的定势转换能力,但其表现显著受输入模态和提示策略的影响。此外,研究发现通过角色扮演,VLLMs可模拟与认知灵活性受损患者相符的多种功能性缺陷,这表明VLLMs可能具有至少就定势转换能力而言与大脑相似的认知架构。本研究揭示了VLLMs在人类高阶认知关键组成部分上已接近人类水平的事实,并凸显了其模拟复杂大脑过程的潜在价值。
Efficient Leave-one-out Approximation in LLM Multi-agent Debate Based on Introspection
Abstract
arXiv:2505.22192v1 Announce Type: new Abstract: Multi-agent systems based on large language models (LLMs) advance automatic task completion in various fields, where debate is a common cooperation form for agents to solve complicated problems with reasoning and cross-review to solidify answers. Assessing the individual contributions of agents within these debates is crucial for system refinement and outcome reliability. Traditional leave-one-out (LOO) method offers a clear framework for evaluating each agent's role but face challenges in LLM-based systems due to high computational costs and associated financial implications. This paper presents introspective-leave-one-out (IntrospecLOO), a simple yet effective prompting for approximation of LOO in LLM-powered multi-agent debates. IntrospecLOO introduces an additional querying round after standard debates, prompting agents to update their answers while ignoring responses from a designated agent. This strategy effectively isolates and gauges each participant's influence at a reduced query complexity compared to the original LOO approaches. Validation through experiments on three benchmark datasets confirms the effectiveness of IntrospecLOO.
摘要
基于大语言模型(LLM)的多智能体系统推动了各领域自动任务完成的进展,其中辩论是智能体通过推理和交叉评审来解决复杂问题并巩固答案的常见协作形式。评估这些辩论中每个智能体的个体贡献对于系统优化和结果可靠性至关重要。传统的留一法(LOO)为评估各智能体作用提供了清晰框架,但在基于LLM的系统中面临高计算成本和相应财务影响等挑战。本文提出内省留一法(IntrospecLOO),这是一种简单而有效的提示方法,用于近似计算LLM驱动的多智能体辩论中的LOO。IntrospecLOO在标准辩论后引入额外查询轮次,提示智能体在忽略指定智能体响应的情况下更新答案。与原始LOO方法相比,该策略以更低查询复杂度有效隔离并量化了每个参与者的影响。通过在三个基准数据集上的实验验证,证实了IntrospecLOO的有效性。
What Makes a Good Reasoning Chain? Uncovering Structural Patterns in Long Chain-of-Thought Reasoning
Abstract
arXiv:2505.22148v1 Announce Type: new Abstract: Recent advances in reasoning with large language models (LLMs) have popularized Long Chain-of-Thought (LCoT), a strategy that encourages deliberate and step-by-step reasoning before producing a final answer. While LCoTs have enabled expert-level performance in complex tasks, how the internal structures of their reasoning chains drive, or even predict, the correctness of final answers remains a critical yet underexplored question. In this work, we present LCoT2Tree, an automated framework that converts sequential LCoTs into hierarchical tree structures and thus enables deeper structural analysis of LLM reasoning. Using graph neural networks (GNNs), we reveal that structural patterns extracted by LCoT2Tree, including exploration, backtracking, and verification, serve as stronger predictors of final performance across a wide range of tasks and models. Leveraging an explainability technique, we further identify critical thought patterns such as over-branching that account for failures. Beyond diagnostic insights, the structural patterns by LCoT2Tree support practical applications, including improving Best-of-N decoding effectiveness. Overall, our results underscore the critical role of internal structures of reasoning chains, positioning LCoT2Tree as a powerful tool for diagnosing, interpreting, and improving reasoning in LLMs.
摘要
大语言模型(LLM)推理领域的最新进展推动了长思维链(LCoT)策略的普及,该策略鼓励在生成最终答案前进行逐步深思熟虑的推理。尽管LCoT已在复杂任务中实现专家级性能,但其推理链的内部结构如何驱动甚至预测最终答案的正确性,仍是一个关键但尚未充分探索的问题。本研究提出LCoT2Tree自动化框架,将序列化LCoT转换为层次化树结构,从而支持对LLM推理进行更深层次的结构分析。通过图神经网络(GNN),我们发现LCoT2Tree提取的结构模式(包括探索、回溯和验证)在多种任务和模型中能更有效地预测最终性能。借助可解释性技术,我们进一步识别出导致失败的临界思维模式(如过度分支)。除诊断价值外,LCoT2Tree揭示的结构模式还支持实际应用,包括提升N选优解码效率。总体而言,我们的研究结果凸显了推理链内部结构的关键作用,使LCoT2Tree成为诊断、解释和改进LLM推理的强大工具。
ChatPD: An LLM-driven Paper-Dataset Networking System
Abstract
arXiv:2505.22349v1 Announce Type: new Abstract: Scientific research heavily depends on suitable datasets for method validation, but existing academic platforms with dataset management like PapersWithCode suffer from inefficiencies in their manual workflow. To overcome this bottleneck, we present a system, called ChatPD, that utilizes Large Language Models (LLMs) to automate dataset information extraction from academic papers and construct a structured paper-dataset network. Our system consists of three key modules: \textit{paper collection}, \textit{dataset information extraction}, and \textit{dataset entity resolution} to construct paper-dataset networks. Specifically, we propose a \textit{Graph Completion and Inference} strategy to map dataset descriptions to their corresponding entities. Through extensive experiments, we demonstrate that ChatPD not only outperforms the existing platform PapersWithCode in dataset usage extraction but also achieves about 90% precision and recall in entity resolution tasks. Moreover, we have deployed ChatPD to continuously extract which datasets are used in papers, and provide a dataset discovery service, such as task-specific dataset queries and similar dataset recommendations. We open source ChatPD and the current paper-dataset network on this [GitHub repository]{https://github.com/ChatPD-web/ChatPD}.
AgentDNS: A Root Domain Naming System for LLM Agents
Abstract
arXiv:2505.22368v1 Announce Type: new Abstract: The rapid evolution of Large Language Model (LLM) agents has highlighted critical challenges in cross-vendor service discovery, interoperability, and communication. Existing protocols like model context protocol and agent-to-agent protocol have made significant strides in standardizing interoperability between agents and tools, as well as communication among multi-agents. However, there remains a lack of standardized protocols and solutions for service discovery across different agent and tool vendors. In this paper, we propose AgentDNS, a root domain naming and service discovery system designed to enable LLM agents to autonomously discover, resolve, and securely invoke third-party agent and tool services across organizational and technological boundaries. Inspired by the principles of the traditional DNS, AgentDNS introduces a structured mechanism for service registration, semantic service discovery, secure invocation, and unified billing. We detail the architecture, core functionalities, and use cases of AgentDNS, demonstrating its potential to streamline multi-agent collaboration in real-world scenarios. The source code will be published on https://github.com/agentdns.
摘要
大型语言模型(LLM)代理的快速发展凸显了跨厂商服务发现、互操作性与通信方面的关键挑战。现有协议如模型上下文协议和代理间协议在标准化代理与工具间的互操作性以及多代理通信方面取得了显著进展。然而,针对不同代理和工具厂商之间的服务发现,目前仍缺乏标准化协议与解决方案。本文提出AgentDNS,这是一个根域名命名与服务发现系统,旨在使LLM代理能够跨组织与技术边界自主发现、解析并安全调用第三方代理与工具服务。受传统DNS原理启发,AgentDNS引入了结构化服务注册机制、语义化服务发现、安全调用及统一计费方案。我们详细阐述了AgentDNS的架构、核心功能及应用场景,证明其在实际场景中优化多代理协作的潜力。源代码将发布于https://github.com/agentdns。